来自重力波检测器的数据中出现的瞬态噪声通常会引起问题,例如检测器的不稳定性以及重叠或模仿重力波信号。由于瞬态噪声被认为与环境和工具相关联,因此其分类将有助于理解其起源并改善探测器的性能。在先前的研究中,提出了用于使用时频2D图像(频谱图)进行瞬态噪声进行分类的体系结构,该架构使用了无监督的深度学习与变异自动编码器和不变信息集群的结合。提出的无监督学习结构应用于重力间谍数据集,该数据集由高级激光干涉仪重力波动台(Advanced Ligo)瞬态噪声与其相关元数据进行讨论,以讨论在线或离线数据分析的潜力。在这项研究的重点是重力间谍数据集中,研究并报告了先前研究的无监督学习结构的训练过程。
translated by 谷歌翻译
This study demonstrates the feasibility of point cloud-based proactive link quality prediction for millimeter-wave (mmWave) communications. Image-based methods to quantitatively and deterministically predict future received signal strength using machine learning from time series of depth images to mitigate the human body line-of-sight (LOS) path blockage in mmWave communications have been proposed. However, image-based methods have been limited in applicable environments because camera images may contain private information. Thus, this study demonstrates the feasibility of using point clouds obtained from light detection and ranging (LiDAR) for the mmWave link quality prediction. Point clouds represent three-dimensional (3D) spaces as a set of points and are sparser and less likely to contain sensitive information than camera images. Additionally, point clouds provide 3D position and motion information, which is necessary for understanding the radio propagation environment involving pedestrians. This study designs the mmWave link quality prediction method and conducts two experimental evaluations using different types of point clouds obtained from LiDAR and depth cameras, as well as different numerical indicators of link quality, received signal strength and throughput. Based on these experiments, our proposed method can predict future large attenuation of mmWave link quality due to LOS blockage by human bodies, therefore our point cloud-based method can be an alternative to image-based methods.
translated by 谷歌翻译
The task of out-of-distribution (OOD) detection is vital to realize safe and reliable operation for real-world applications. After the failure of likelihood-based detection in high dimensions had been shown, approaches based on the \emph{typical set} have been attracting attention; however, they still have not achieved satisfactory performance. Beginning by presenting the failure case of the typicality-based approach, we propose a new reconstruction error-based approach that employs normalizing flow (NF). We further introduce a typicality-based penalty, and by incorporating it into the reconstruction error in NF, we propose a new OOD detection method, penalized reconstruction error (PRE). Because the PRE detects test inputs that lie off the in-distribution manifold, it effectively detects adversarial examples as well as OOD examples. We show the effectiveness of our method through the evaluation using natural image datasets, CIFAR-10, TinyImageNet, and ILSVRC2012.
translated by 谷歌翻译
Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query. Such text-queried sound separation systems provide a natural and scalable interface for specifying arbitrary target sounds. However, supervised text-queried sound separation systems require costly labeled audio-text pairs for training. Moreover, the audio provided in existing datasets is often recorded in a controlled environment, causing a considerable generalization gap to noisy audio in the wild. In this work, we aim to approach text-queried universal sound separation by using only unlabeled data. We propose to leverage the visual modality as a bridge to learn the desired audio-textual correspondence. The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model, and the query vector is then used to condition an audio separation model to separate out the target sound. While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting, thanks to the joint language-image embedding learned by the CLIP model. Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence. To address this problem, we further propose an approach called noise invariant training for training a query-based sound separation model on noisy data. Experimental results show that the proposed models successfully learn text-queried universal sound separation using only noisy unlabeled videos, even achieving competitive performance against a supervised model in some settings.
translated by 谷歌翻译
The demand for resilient logistics networks has increased because of recent disasters. When we consider optimization problems, entropy regularization is a powerful tool for the diversification of a solution. In this study, we proposed a method for designing a resilient logistics network based on entropy regularization. Moreover, we proposed a method for analytical resilience criteria to reduce the ambiguity of resilience. First, we modeled the logistics network, including factories, distribution bases, and sales outlets in an efficient framework using entropy regularization. Next, we formulated a resilience criterion based on probabilistic cost and Kullback--Leibler divergence. Finally, our method was performed using a simple logistics network, and the resilience of the three logistics plans designed by entropy regularization was demonstrated.
translated by 谷歌翻译
Telework "avatar work," in which people with disabilities can engage in physical work such as customer service, is being implemented in society. In order to enable avatar work in a variety of occupations, we propose a mobile sales system using a mobile frozen drink machine and an avatar robot "OriHime", focusing on mobile customer service like peddling. The effect of the peddling by the system on the customers are examined based on the results of video annotation.
translated by 谷歌翻译
本文从未分割的烹饪视频中解决了食谱生成,该任务要求代理(1)提取完成盘子时提取关键事件,以及(2)为提取的事件生成句子。我们的任务类似于密集的视频字幕(DVC),该字幕旨在彻底检测事件并为其生成句子。但是,与DVC不同,在食谱生成中,食谱故事意识至关重要,模型应以正确的顺序输出适当数量的关键事件。我们分析了DVC模型的输出,并观察到,尽管(1)几个事件可作为食谱故事采用,但(2)此类事件的生成句子并未基于视觉内容。基于此,我们假设我们可以通过从DVC模型的输出事件中选择Oracle事件并为其重新生成句子来获得正确的配方。为了实现这一目标,我们提出了一种基于变压器的新型训练事件选择器和句子生成器的联合方法,用于从DVC模型的输出中选择Oracle事件并分别为事件生成接地句子。此外,我们通过包括成分来生成更准确的配方来扩展模型。实验结果表明,所提出的方法优于最先进的DVC模型。我们还确认,通过以故事感知方式对食谱进行建模,提出的模型以正确的顺序输出适当数量的事件。
translated by 谷歌翻译
我们提出了一个名为“ Visual配方流”的新的多模式数据集,使我们能够学习每个烹饪动作的结果。数据集由对象状态变化和配方文本的工作流程组成。状态变化表示为图像对,而工作流则表示为食谱流图(R-FG)。图像对接地在R-FG中,该R-FG提供了交叉模式关系。使用我们的数据集,可以尝试从多模式常识推理和程序文本生成来尝试一系列应用程序。
translated by 谷歌翻译
在本文中,我们提出了一个模型,以执行语音转换为歌声。与以前的基于信号处理的方法相反,基于信号处理的方法需要高质量的唱歌模板或音素同步,我们探索了一种数据驱动的方法,即将自然语音转换为唱歌声音的问题。我们开发了一种新型的神经网络体系结构,称为Symnet,该结构将输入语音与目标旋律的一致性建模,同时保留了说话者的身份和自然性。所提出的符号模型由三种类型层的对称堆栈组成:卷积,变压器和自发层。本文还探讨了新的数据增强和生成损耗退火方法,以促进模型培训。实验是在NUS和NHSS数据集上进行的,这些数据集由语音和唱歌语音的平行数据组成。在这些实验中,我们表明所提出的SYMNET模型在先前发表的方法和基线体系结构上显着提高了客观重建质量。此外,主观听力测试证实了使用拟议方法获得的音频质量的提高(绝对提高了0.37的平均意见分数测度量度比基线系统)。
translated by 谷歌翻译
我们提出Unrealego,即,一种用于以Egentric 3D人类姿势估计的新的大规模自然主义数据集。Unrealego是基于配备两个鱼眼摄像机的眼镜的高级概念,可用于无约束的环境。我们设计了它们的虚拟原型,并将其附加到3D人体模型中以进行立体视图捕获。接下来,我们会产生大量的人类动作。结果,Unrealego是第一个在现有的EgeCentric数据集中提供最大动作的野外立体声图像的数据集。此外,我们提出了一种新的基准方法,其简单但有效的想法是为立体声输入设计2D关键点估计模块,以改善3D人体姿势估计。广泛的实验表明,我们的方法在定性和定量上优于先前的最新方法。Unrealego和我们的源代码可在我们的项目网页上找到。
translated by 谷歌翻译